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Goals 


* Give an overview to GPU ISA 
e Knowing which programs run faster than others 


* Preparation to read the official documentation from 
AMD 


TU Dresden, 09.11.11 R600 ISA Folie 2 


History 


e R600 is the chip used in Radeon HD 2000/3000 cards 
and FireGL 2007 series 


e Introduced unified shader architecture for PC 


e Consider R600 as a massive multicore CPU where each 
core has massive hyper threading 


Source: http://en.wikipedia.org/wiki/Radeon_R600 
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Host Application 
Compute Driver 


System Memory 


Instructions 
and Constants 
Inputs 
and Outputs 


Ultra-Threaded Dispatch Processor 


Evergreen Device 
Memory 


Instructions 
and Constants 


Inputs 
and Outputs 


Return 
Buffers 


Memory Controller 
L1 input Caches 


Global Data Share 


Output Cache 


Private 
Data 


L2 Input Cache 


Source: http://developer.amd.com/sdks/AMDAPPSDkK/assets/AMD_Evergreen-Family_Instruction_Set_Architecture. pdf 
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Concepts 


Command Processor (CP) 


. Ring buffer 
. Indirect buffer 0 


« Indirect buffer 1 
Pipelines 


. Vertices 
« Geometry 
. Fragments 
Wavefronts 


« 64 threads run with the same program counter 
« Control flow contains loop instructions 

. Each thread has it's own data and ALU 

« IF/ELSE with execution mask 

Memory access from shaders 

. Vertex fetch (Buffers) 

. Texture fetch (Textures) 

« RAT (available since r800) 


TU Dresden, 09.11.11 R600 ISA Folie 5 


Program Types 


e Vertex Shader 

e Geometry Shader 

DMA Copy 

Pixel Shader 

New to r800: 

e Compute Shaders 
e Hull Shader 

e Domain Shader 
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Thread Organization 


e "The R600 processor hides memory latency by keeping 
track of potentially hundreds of threads in different 
Stages of execution, and by overlapping compute 
operations with memory-access operations.’ (source: 
r600isa.pdf) 


e Thread state consists of 
e GPRs 
e CRS 
e Temp registers for ALU, VTX and TX clauses 
e Execution mask 
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Control Flow Programs 


e One instruction: 64 bits 

e Call ALU clauses (ALU), texture fetch clauses and vertex fetch clauses (VTX) 

e import and export data 

e Functions to emit vertices, primitives and such 

e Write and read on ring buffers, scratch buffers, reduction buffers, stream buffers 


e Loops with LOOP BEGIN, LOOP BREAK, LOOP CONTINUE and LOOP END and a 
loop count (can be nested) 


« PUSH, POP, ELSE, JUMP 
e Manipulate execution mask 
e Execution mask can predicate instruction execution 


e JUMPs speed up the program: they can skip instructions when all 
threads are have a certain flag in the execution mask 


e Subprograms with CALL and RETURN 
e END OF PROGRAM 
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Table 2.7 Flow of a Typical Program 


Microcode Formats! 


Start loop. CF DWORD[O, 1] 


F etch through a texture cache or vertex cache TEX DWORD[O,1,2] 
clause to load data from memory to GPRs. 


Initiate ALU clause. CF ALU DWORD[O,1] 


ALU clause to compute on loaded data and lit- ALU DWORD[O,1] 

eral constants. This example shows a single ALU DWORD[O,1] 

clause consisting of a single ALU instruction ALU DWORD[O, 1] 

group containing five ALU instructions (two ALU DWORD[O, 1] 

quadwords each) and two quadwords of literal ALU DWORD[O,1] LAST bit set 
constants. Literal [X, ¥] 


Literal [7,1] 


Allocate space in an output buffer. CF ALLOC EXPORT DWORDO 
CF ALLOC EXPORT DWORDI BU 
F 


Export (write) results from GPRs to output CF ALLOC EXPORT DWORDO 
buffer. CF ALLOC EXPORT DWORD1 BU 
F 


Source: http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf 
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0000 
0001 
0002 
0003 
0008 
0009 
0010 
0011 
0012 
0013 
0014 
0015 
0016 
0017 
0018 
0019 
0020 
0021 
0022 
0023 
0004 
0005 
0006 
0007 


Up to 5 slots (X, Y, Z, W, Trans) 


Transcendent slot can perform more complex operations 


0,2 or 4 literals 


64 bits per instruction 
Can access 128 GPRs and 256 constants 
Call from CF with ALU or PRED ALU 


00000000 
84C00000 
00000004 
A01C0000 
00000001 
00600C908 
00000401 
20600C90 
00000801 
40600C90 
80000C01 
60600C98 
00000002 
00800C90 
00000402 
20800C90 
00000802 
40800C90 
80000C02 
60800C90 
C001A03C 
95000688 
C0024000 
95200688 


CF ADDR:0 
CF INST:19 COND:6 POP_COUNT:® 


ALU ADDR:8 KCACHE ΜΟΡΕΘ:Θ KCACHE BANK0:0 KCACHE BANK1:0 


ALU INST:64 KCACHE MODE1:0 KCACHE ADDRO:0 


SRCO(SEL:1 REL:@ CHAN:0 NEG:0) 
INST:25 DST(SEL:3 CHAN:0 REL:0 
SRCO(SEL:1 REL:® CHAN:1 NEG:0) 
INST:25 DST(SEL:3 CHAN:1 REL:0 
SRCO(SEL:1 REL:0 CHAN:2 NEG:0) 
INST:25 DST(SEL:3 CHAN:2 REL:0 
SRCO(SEL:1 REL:® CHAN:3 NEG:0) 
* INST:25 DST(SEL:3 CHAN:3 REL:® 
SRCO(SEL:2 REL:® CHAN:0 NEG:0) 
INST:25 DST(SEL:4 CHAN:0 REL:0 
SRCO(SEL:2 REL:® CHAN:1 NEG:0) 
INST:25 DST(SEL:4 CHAN:1 REL:0 
SRCO(SEL:2 REL:0 CHAN:2 NEG:0) 
INST:25 DST(SEL:4 CHAN:2 REL:0 
SRCO(SEL:2 REL:® CHAN:3 NEG:0) 
* INST:25 DST(SEL:4 CHAN:3 REL:® 


SRC1(SEL: 
CLAMP: 0) 
SRC1(SEL: 
CLAMP: 6) 
SRC1(SEL: 
CLAMP: 6) 
SRC1(SEL: 
CLAMP: 6) 
SRC1(SEL: 
CLAMP : 0) 
SRC1(SEL: 
CLAMP: 6) 
SRC1(SEL: 
CLAMP : 6) 
SRC1(SEL: 
CLAMP: 6) 


KCACHE ADDR1:0 
6 REL:® CHAN:0 
BANK SNIZZLE:0 
6 REL:® CHAN:0 
BANK SWIZZLE:0 
6 REL:0 CHAN:0 
:0 
0 
9 
Le) 
θ 
0 
[ο] 


BANK SNIZZLE 


0 REL:® CHAN: 
BANK SWIZZLE: 
0 REL:® CHAN: 
BANK SWIZZLE: 
0 REL:® CHAN: 
BANK SWIZZLE: 
© REL:® CHAN: 
:0 


BANK SWIZZLE 


6 REL:® CHAN: 
BANK SNIZZLE:6 SRCO ABS:0 SRC1 ABS: 


EXPORT GPR:3 ELEM SIZE:3 ARRAY BASE:3C TYPE:1 
EXPORT SWIZ X:0 SWIZ Y:1 SWIZ_Z:2 SWIZ W:3 BARRIER:1 INST:84 BURST COUNT:1 EOP:® 
EXPORT GPR:4 ELEM SIZE:3 ARRAY BASE:0 TYPE:2 
EXPORT SWIZ X:0 SWIZ Y:1 SWIZ Z:2 SWIZ W:3 BARRIER:1 INST:84 BURST COUNT:1 EOP:1 
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COUNT:8 
NEG:0) LAST:0) 


SRCO ABS:0 SRC1 ABS: 


NEG:0) LAST:®) 


SRCO ABS:0 SRC1 ABS: 


NEG:0) LAST:0) 


SRCO ABS:0 SRC1 ABS: 


NEG:0) LAST:1) 


SRCO ABS:0 SRC1 ABS: 


NEG:0) LAST:0) 


SRCO ABS:0 SRC1_ ABS: 


NEG:0) LAST:0) 


SRCO ABS:0 SRC1 ABS: 


NEG:0) LAST:®) 


RCO ABS:0 SRC1 ABS: 


EG:0) LAST:1) 


R600 ISA 


WRITE MASK: 
WRITE MASK: 
WRITE MASK: 
WRITE MASK: 
WRITE MASK: 
WRITE MASK: 
WRITE MASK: 


WRITE MASK: 


OMOD: 


OMOD: 


EXECUTE MASK: 
EXECUTE MASK: 
EXECUTE MASK: 
EXECUTE MASK: 
EXECUTE MASK: 
EXECUTE MASK: 
EXECUTE MASK: 


EXECUTE_MASK: 


UPDATE PRED: 
UPDATE PRED: 
UPDATE PRED: 
UPDATE PRED: 
UPDATE PRED: 
UPDATE PRED: 
UPDATE PRED: 


UPDATE PRED: 
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e Source selection flags e Instruction 
e Read from GPR 
e Read from constant bank 
e Read a previous result * CLAMP bits 
e Load a literal constant 
e Load float 0.0, 0.5 or 1.0 
e Load integer -1, 0, 1 
e Set ABS and/or NEG bit 

e Destination GPR 


Source: http://x.org/docs/AMD/r600isa. pdf 
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e Output modifier 


Constant 2 Constant 3 
XYZW XY 


" 
E 
Na, 


ALU.X 


Source: http://x.org/docs/AMD/r600isa. pdf 


ALU.Trans 


Figure 4-3. ALU Data Flovv 
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ADD INT, AND INT, MUL, MUL IEEE, MULADD, MULADD D2, MULADD DA, 
MULADD IEEE D2 


MOV, CMOV?? INT, PRED SET?? INT, SET?? INT, CMOV??, PRED SET??, SET?? 
MIN, MAX, TRUNC, CEIL 

Restricted to XYZW (no Trans) 

e DOT4, DOTA IEEE, MAX4 

Restricted to Trans unit 

e ASHR INT, INT TO FLT, MULLO INT, MULHI INT, RECIP UINT 

e SIN, COS, EXP IEEE, LOG CLAMPED, LOG IEEE 

« RECIP IEEE, RECIP FF, RECIP CLAMPED 

e MUL LIT D2, MUL LIT DA 

e RECIPSQRT CLAMPED, RECIPSQRT FF, RECIPSQRT IEEE 
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ojoj +12 


+8 
+4 


+0 


Source: http://x.org/docs/AMD/r600isa. pdf 


0000 00000002 TEX/VTX ADDR:4 
0001 80800400 TEX/VTX INST:2 COUNT:2 


0004 7C000000 
0005 AC151001 
0006 00080000 
0007 00000000 
0008 7C000000 
0009 AC151002 
0010 0008000C 
0011 00000000 


INST:0 FETCH TYPE:0 BUFFER_ID:0 
SRC(GPR:0 SEL X:0) MEGA FETCH COUNT:31 DST(GPR:1 SEL X:0 SEL Y:1 SEL Z:2 SEL W:5) USE CONST FIELDS:0 FORMAT(DATA:48 NUM:2 COMP:® MODE:1) 
ENDIAN:0 OFFSET:® 


INST:0 FETCH TYPE:0 BUFFER ID:0 
SRC(GPR:® SEL X:0) MEGA FETCH COUNT:31 DST(GPR:2 SEL X:0 SEL Y:1 SEL Z:2 SEL W:5) USE CONST FIELDS:0 FORMAT(DATA:48 NUM:2 COMP:0 MODE:1) 
ENDIAN:0 OFFSET:12 


0002 00000000 CF ADDR:® 
0003 85000000 CF INST:20 COND:0 POP COUNT:6 


TU Dresden, 09.11.11 R600 ISA Folie 14 


Consequences to the compiler developer 


e CF 


e Turn if/else instructions to execution mask operations 
e Turn while, do..loop and jumps into LOOP τ 
e Find instructions to skip with JUMP 
e ALU 
e Try to fill all 5 ALU slots 
e Obvserve all restrictions 
e Vectorize 4 threads into one 
e Memory 
ο Find the right (=fastest) buffer type 
ο Write cache friendly programs 


e Safe memory accessing instructions 
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http://x.org/docs/AMD/r600isa.pdf 


http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Evergreen- 
Family Instruction Set Architecture.pdf 
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WÉI j 
d 1 » 


»Wissen schafft Brucken.« 
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